Optional SIMD strlen by ncruces · Pull Request #586 · WebAssembly/wasi-libc

ncruces · 2025-06-11T15:11:22Z

This is a potential first step at upstreaming #580 piecemeal.

Chose strlen as it's already covered by tests. It's also one of the simplest (easiest to review) implementations.

Built and tested with:

CC=[PATH_TO]/wasi-sdk-25.0-x86_64-linux/bin/clang \
  EXTRA_CFLAGS="-O2 -DNDEBUG -msimd128 -mbulk-memory -D__wasilibc_simd_string" make
(cd test ; CC=[PATH_TO]/wasi-sdk-25.0-x86_64-linux/bin/clang make )

ncruces · 2025-06-12T16:36:01Z

That's odd. Do I need to guard that include? It worked with my clang, but apparently not here.

abrown · 2025-06-12T16:57:35Z

Yes, take a look at how Clang is set up here with main.yml; I don't see anything that would set up its include path.

abrown

@ncruces and I are talking offline about some of the CI issues; the easy answer for now is to just guard the extra wasm_simd.h header. To be honest, maybe some of this problem would go away if wasi-libc and wasi-sdk were merged (#427); the headers should be right there.

I will note that we probably want to check a few more corner cases here, especially since a quick search of the test directory doesn't reveal much testing of strlen (e.g., grep -Ir strlen test). I think the one concern several of us had when we discussed this is, "what happens when we run into the border of WebAssembly memory?" If the v128 loads were to "over-read" and result in a trap, that would be different behavior than the scalar version of this function. I think it would be good to add a test/src/misc/strlen.c that checks for edge cases like:

the last byte in memory is 0 but a v128 read crosses the end of memory; should not trap
the last byte in memory is not 0; should trap for both SIMD and scalar
from a string starting at byte 1, especially where the previous byte is 0 to test some of the alignment logic
from a string ending at an unaligned address; especially where the trailing bytes are some pattern of 0 and 1

Perhaps you already have some of these kinds of tests implemented elsewhere?

ncruces · 2025-06-12T18:16:30Z

@abrown, those cases are considered, and yes I have tests for them, just not in C (I do the setup in the host, which for me is in Go): exhaustive tests. I can try to port some of this.

The problem with testing "the last byte of memory" from inside C, is that I don't think I can force malloc to give me that memory. Any ideas?

Handling those cases is the whole purpose of aligning the pointer: an aligned load cannot trap, because the 16-bytes can't cross a page (at least until clang supports custom page sizes).

ncruces · 2025-06-12T18:35:52Z

Also, I don't think this one is something we can or should test:

“the last byte in memory is not 0; should trap for both SIMD and scalar”

This is undefined behaviour, so I don't think C promises a trap. Or in fact that it can promise a trap. As soon as clang figures out that it is undefined behaviour, all bets are off.

Also, I don't think there's a change of behaviour, but (to give you an idea of the problem here) I'm reasonably sure if the module allocates the full 4GB, both the existing strlen and my SIMD version both wrap around memory to address zero (and don't trap, because dereferencing zero is allowed) if the last address of the 4GB is not zero.

ncruces · 2025-06-13T12:15:56Z

Added tests. I have https://github.com/ncruces/wasi-libc/commit/42515886af4cf480845c0a5910cb7cc6649a534b to test this in CI, but it uses wasi-sdk and modifies your Makefile so I'm not sure how you wanna tackle that.

abrown · 2025-06-17T21:41:49Z

I don't think this one is something we can or should test

Yeah, I see what you mean. Additionally, I may not be remembering correctly, but I don't think we have a way to check that a test traps (?). I think it is all structured the other way around. So, yes, let's avoid that case.

ncruces · 2025-06-17T21:46:09Z

Yes, I don't think so. Basically, it seems any test that writes to stdout seems to be a failure. This surprised me, I thought exit code would matter.

Let me know if you want me to add anything like the CI commit I linked in #586 (comment).

abrown

I think this makes sense; let's add the doc comments and I will +1. We've identified the bigger problem here (which @ncruces shouldn't have to solve) which is that we need an easier way to test this kind of functionality in CI; right now this test (a good one, thanks!) only runs in with the scalar version of strlen. I've opened an issue to figure this out (#587), but in the meantime I would be willing to merge this, especially since it is behind flags, etc..

test/src/misc/strlen.c

abrown · 2025-06-17T22:34:27Z

Yes, I don't think so. Basically, it seems any test that writes to stdout seems to be a failure. This surprised me, I thought exit code would matter.

Let me know if you want me to add anything like the CI commit I linked in #586 (comment).

Oh, the exit code does matter:

wasi-libc/test/scripts/run-test.sh

Line 20 in 4720b34

[ $? -eq 0 ] || echo "Test failed" >> output.log

But, yeah, eventually we could and probably should relax the "no output" check:

wasi-libc/test/scripts/failed-tests.sh

Line 14 in 4720b34

FAILED=$(find $TESTDIR -type f -and -name *.log -and -not -size 0)

I can't remember exactly why that is there but my guess would be that libc-test only uses output in error cases (?). If you're interested in making that more robust, it likely wouldn't be too difficult to add expected.log files for tests that need this and then use that (if present) to compare against output.log.

There are a number of changes in this commit aimed at addressing WebAssembly#587 and making it easier to test WebAssembly#586 in CI. Notable changes here are: * The matrix of what to test is much different from before. One matrix entry now builds just one target and optionally tests that target. * The CI matrix ensures that wasi-libc builds on a variety of platforms, e.g. Windows/macOS/Linux as well as Linux aarch64. * The CI matrix has one entry for building with an older version of LLVM. This version was bumped from LLVM to LLVM 11 since LLVM is installed through `apt-get`, not through downloads any more. * On Linux LLVM/Clang are downloaded through `apt-get` instead of from llvm-project binaries to avoid dealing with `libtinfo5` and dependencies. * The CI matrix has a test job per-target. This can be expanded/shrunk as necessary but currently everything is tested with LLVM 16 (as before) and only on Linux (also as before). The test run is seqeunced to happen after the build of libc itself. * The CI matrix has split out V8 headless tests into their own job to avoid running multiple suites of tests in a single job. * Installation of LLVM is refactored to a separate action to reduce the noise in `main.yml`. * Setting `TARGET_TRIPLE` can now be done through environment variables as opposed to only through arguments to `make`. * Handling of `BULITINS_LIB` has improved. Previously the build target for `libc_so` would modify the compiler's resource directory and this is updated to use a custom directory in `OBJDIR`. * Arranging compiler-rt for tests is now done with `-resource-dir` instead of copying the directory into the system compiler's location. Overall it's the intention that no amount of testing is lost in this PR. The goal is to expand things out in such a way that it's much easier to add one-off tests of wasi-libc in various build configurations and such. The theory is that this is as "simple" as adding a new matrix entry, copied from previous ones, customized with various variables and environment variables to affect the build (e.g. `CFLAGS`). Closes WebAssembly#587

alexcrichton

I wanted to echo again thank you @ncruces for working on this, it's much appreciated!

For the opt-in, this is currently using __wasilibc_simd_string, but if I could bikeshed that a bit maybe this could be renamed to _WASI_SIMD? Using all-caps feels a bit more idiomatic with what wasi-libc does today with other user-provided #defines such as _WASI_EMULATED_MMAN and also scoping it to just "SIMD" instead of "SIMD_STRING" helps reuse this for anywhere else in wasilibc we might want to use simd too. (under the assumption that multiple layers of opt-in aren't necessary that is)

alexcrichton · 2025-06-18T22:08:49Z

libc-top-half/musl/src/string/strlen.c

+	// strlen must stop as soon as it finds the terminator.
+	// Aligning ensures loads beyond the terminator are safe.
+	uintptr_t align = (uintptr_t)s % sizeof(v128_t);
+	const v128_t *v = (v128_t *)(s - align);


I'm by no means an expert on C, but this looks like it's reading data prior to an allocated object which I believe is UB. This is UB at the C level, I think, although I realize there's no issue reading at the wasm-level.

Would it be possible to do the scalar loop up until s is 16-byte aligned?

Well, I can probably rephrase this a bit. I'm pretty certain this is UB to read data before an allocated pointer. This program:

#include <stdio.h> #include <stdlib.h> int main() { char *c = malloc(1); *c = 0; printf("%d\n", *(c - 1)); }

runs as:

$ clang foo.c -fsanitize=address && ./a.out ================================================================= ==2060611==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x50200000000f at pc 0x5a6c161867c3 bp 0x7fff90b4b5a0 sp 0x7fff90b4b598 READ of size 1 at 0x50200000000f thread T0 #0 0x5a6c161867c2 in main (/home/alex/code/wasi-libc/a.out+0x1047c2) (BuildId: b1ea958363f739949e3f7cab54dc5a56182d8397) #1 0x72399122a1c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #2 0x72399122a28a in __libc_start_main csu/../csu/libc-start.c:360:3 #3 0x5a6c160ad344 in _start (/home/alex/code/wasi-libc/a.out+0x2b344) (BuildId: b1ea958363f739949e3f7cab54dc5a56182d8397) ...

Possible, yes. It kinda defeats the purpose, in both performance (speed) and code size (when optimizing for speed and size both).

Yes, reading before the buffer is undefined at the C level, but so is reading past the buffer, which, since the buffer is of an unknown size is impossible not to do: the existing SWAR code does it.

So, yes, probably the correct way to go, is to do this in assembly (as musl, glibc do). If that's the path you decide on I guess you could do worse than start with the code this currently generates.

Or you need to decide what kind of UB is acceptable at the C level. Does __may_alias__ fix it for the existing SWAR code? If so, using wasm_v128_load should allay your concerns. And if it doesn't, why not? It's not necessary, mind you (because the address is aligned), but it if it "hides" UB from the compiler, that's OK.

Note that the reason SWAR uses scalar loops before/after should be mostly because the SWAR equivalent to wasm_i8x16_bitmask is hard/slow, and not because of concerns about dereferencing a w that only partially covers the buffer.

Also, I've used emcc with sanitizers to build the following:

#include <limits.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <wasm_simd128.h> #define ALIGN (sizeof(size_t)) #define ONES ((size_t)-1/UCHAR_MAX) #define HIGHS (ONES * (UCHAR_MAX/2+1)) #define HASZERO(x) ((x)-ONES & ~(x) & HIGHS) size_t strlen2(const char *s) { #ifdef __wasm_simd128__ // strlen must stop as soon as it finds the terminator. // Aligning ensures loads beyond the terminator are safe. uintptr_t align = (uintptr_t)s % sizeof(v128_t); const v128_t *v = (v128_t *)(s - align); for (;;) { // Bitmask is slow on AArch64, all_true is much faster. if (!wasm_i8x16_all_true(*v)) { const v128_t cmp = wasm_i8x16_eq(*v, (v128_t){}); // Clear the bits corresponding to alignment (little-endian) // so we can count trailing zeros. int mask = wasm_i8x16_bitmask(cmp) >> align << align; // At least one bit will be set, unless we cleared them. // Knowing this helps the compiler. __builtin_assume(mask || align); // If the mask is zero because of alignment, // it's as if we didn't find anything. if (mask) { // Find the offset of the first one bit (little-endian). return (char *)v - s + __builtin_ctz(mask); } } align = 0; v++; } #endif const char *a = s; #ifdef __GNUC__ typedef size_t __attribute__((__may_alias__)) word; const word *w; for (; (uintptr_t)s % ALIGN; s++) if (!*s) return s-a; for (w = (const void *)s; !HASZERO(*w); w++); s = (const void *)w; #endif for (; *s; s++); return s-a; } int main() { char *c = malloc(1); c[0] = 0; printf("%lu\n", strlen2(c)); free(c); }

And sure enough, it does trip the sanitizer:

$ emcc -msimd128 main.c -o main.html --emrun -fsanitize=address && emrun main.html Now listening at http://0.0.0.0:6931/ Opening in existing browser session. ================================================================= ==42==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x136007b0 at pc 0x00000d20 bp 0x129574c0 sp 0x129574cc READ of size 16 at 0x136007b0 thread T0 #0 0x00000d20 in wasm-function[30]+0xd20 (this.program+0xd20) 0x136007b1 is located 0 bytes after 1-byte region [0x136007b0,0x136007b1) allocated by thread T0 here: #0 0x000191a5 in wasm-function[373]+0x191a5 (this.program+0x191a5) #1 0x00000f4c in wasm-function[31]+0xf4c (this.program+0xf4c) #2 0x000010dc in wasm-function[36]+0x10dc (this.program+0x10dc) #3 0x80000317 (JavaScript+0x317) #4 0x80001199 in callMain http://localhost:6931/main.js:4505:15 #5 0x800011c3 in doRun http://localhost:6931/main.js:4547:24 #6 0x800011ca (JavaScript+0x11ca)

The problem with using that as a benchmark is that building the current musl SWAR code, without SIMD, trips the sanitizer in the exact same way:

$ emcc main.c -o main.html --emrun -fsanitize=address && emrun main.html Now listening at http://0.0.0.0:6931/ Opening in existing browser session. ================================================================= ==42==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x136007b0 at pc 0x00000e80 bp 0x12957520 sp 0x1295752c READ of size 4 at 0x136007b0 thread T0 #0 0x00000e80 in wasm-function[30]+0xe80 (this.program+0xe80) 0x136007b1 is located 0 bytes after 1-byte region [0x136007b0,0x136007b1) allocated by thread T0 here: #0 0x000193bf in wasm-function[373]+0x193bf (this.program+0x193bf) #1 0x00001166 in wasm-function[31]+0x1166 (this.program+0x1166) #2 0x000012f6 in wasm-function[36]+0x12f6 (this.program+0x12f6) #3 0x80000317 (JavaScript+0x317) #4 0x80001199 in callMain http://localhost:6931/main.js:4505:15 #5 0x800011c3 in doRun http://localhost:6931/main.js:4547:24 #6 0x800011ca (JavaScript+0x11ca)

So, I'm not sure that's really a useful standard to aim for. None of the above run into issues with the undefined sanitizer.

Another possibility that may allay concerns might be to use *(volatile v128_t*)v for any loads.

To me UB is UB and should be avoided at all times, but then again I come from a background of mostly Rust-based development. I don't review musl code but I did take a look at this PR, and I realize it can be confusing when it feels like you're being held to different standards than the rest of the codebase.

I've not interacted or heard of __may_alias__ before myself. That may mean that the error flagged in asan with emcc is a false positive, but then again it could also mean that while __may_alias__ is well-intentioned but it's still UB.

Yes, reading before the buffer is undefined at the C level, but so is reading past the buffer, which, since the buffer is of an unknown size is impossible not to do: the existing SWAR code does it.

Good point! I agree reading past the end is UB too.

Or you need to decide what kind of UB is acceptable at the C level.

Well, this is sort of my background speaking, but IMO there is no amount of UB that is acceptable. Of course reasonable folks can have different stance as well though, I'm just one opinion.

If so, using wasm_v128_load should allay your concerns. And if it doesn't, why not?

...

Another possibility that may allay concerns might be to use (volatile v128_t)v for any loads.

I'm not sure! I'm pretty sure normal reads before/after an allocation are UB, but outside of that realm in the world of volatlie and compiler intrinsics such questions would probably need to be answered by someone with deeper compiler knowledge than I. I am not nor do I want to be an arbiter of whether or not an operation is UB for wasi-libc.

This is UB, all of it is; volatile, may_alias, using intrinsics, may prevent the compiler from acting on it, but that's it; in this case, volatile would work best, I think.

The compiler should also not act on it, if the function isn't inlined, because whether it is UB, depends on the pointer itself (e.g. it's not UB if the pointer points into the middle of a larger buffer). So this should be fine with separate compilation, which is why musl gets away with it.

And you hit nail on the head: if you're going to hold this code to an higher standard than musl, this is simply impossible to do. I may as well drop the whole thing, which is fine (but I'd rather avoid pouring more effort into it).

The only way then, is doing it in assembly, perhaps compiling something, then inspecting the assembly.

If there is still interest in this PR, my reading of the C standard convinces me that 261eadc avoids undefined behavior, replacing it with implementation-defined behavior. This means the implementation (clang) must define a behavior, and stick to it. Crucially, it avoids the compiler deciding it can do anything it wants.

The program from #586 (comment), may print something other than zero. When compiled with emcc -O3 -msimd128 it prints 8, because:

strlen2 is inlined, then

the compiler knows s points to the "start of an object" (it came from malloc),

the pointer math is undefined behavior, and

anything goes.

If you use it with an arbitrary pointer (rather than something straight from malloc) it does the right thing, because it has no way of knowing the aligned pointer is not part of the object.

Replacing the pointer math with the following closes the loophole:

const v128_t *v = (v128_t *)((uintptr_t)s - align);

ISO/IEC 9899:2024:

An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, may not be correctly aligned, may not point to an entity of the referenced type, and can produce an indeterminate representation when stored into an object. ⁵⁸⁾

Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result is not required to be in the range of values of any integer type.

⁵⁸⁾ The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.

Inline assembly may also provide an alternative way of laundering the pointer: https://nullprogram.com/blog/2025/04/19/

There are a number of changes in this commit aimed at addressing #587 and making it easier to test #586 in CI. Notable changes here are: * The matrix of what to test is much different from before. One matrix entry now builds just one target and optionally tests that target. * The CI matrix ensures that wasi-libc builds on a variety of platforms, e.g. Windows/macOS/Linux as well as Linux aarch64. * The CI matrix has one entry for building with an older version of LLVM. This version was bumped from LLVM to LLVM 11 since LLVM is installed through `apt-get`, not through downloads any more. * On Linux LLVM/Clang are downloaded through `apt-get` instead of from llvm-project binaries to avoid dealing with `libtinfo5` and dependencies. * The CI matrix has a test job per-target. This can be expanded/shrunk as necessary but currently everything is tested with LLVM 16 (as before) and only on Linux (also as before). The test run is seqeunced to happen after the build of libc itself. * The CI matrix has split out V8 headless tests into their own job to avoid running multiple suites of tests in a single job. * Installation of LLVM is refactored to a separate action to reduce the noise in `main.yml`. * Setting `TARGET_TRIPLE` can now be done through environment variables as opposed to only through arguments to `make`. * Handling of `BULITINS_LIB` has improved. Previously the build target for `libc_so` would modify the compiler's resource directory and this is updated to use a custom directory in `OBJDIR`. * Arranging compiler-rt for tests is now done with `-resource-dir` instead of copying the directory into the system compiler's location. Overall it's the intention that no amount of testing is lost in this PR. The goal is to expand things out in such a way that it's much easier to add one-off tests of wasi-libc in various build configurations and such. The theory is that this is as "simple" as adding a new matrix entry, copied from previous ones, customized with various variables and environment variables to affect the build (e.g. `CFLAGS`). Closes #587

ncruces · 2025-06-25T15:18:08Z

Added a CI job.

.github/workflows/main.yml

abrown

@ncruces, thank you so much for working through all of this! I just got done discussing this with @alexcrichton and @sunfishcode and they also wanted to relay their thanks for your patience, e.g., waiting for changes to CI to accommodate new testing. @alexcrichton plans to resolve the UB discussion with a follow-on PR that using a v128.load in inline assembly; we can use that as a pattern for resolving this same kind of problem in future SIMD functions you bring over. In the meantime, for everyone else, note that this new behavior is gated behind __wasm_simd128__ and __wasilibc_simd_string, so one would have to build a custom wasi-libc to even see this new implementation.

This is a potential first step at upstreaming WebAssembly#580 piecemeal. Chose `strlen` as it's already covered by tests. It's also one of the simplest (easiest to review) implementations. Built and tested with: ```sh CC=[PATH_TO]/wasi-sdk-25.0-x86_64-linux/bin/clang \ EXTRA_CFLAGS="-O2 -DNDEBUG -msimd128 -mbulk-memory -D__wasilibc_simd_string" make (cd test ; CC=[PATH_TO]/wasi-sdk-25.0-x86_64-linux/bin/clang make ) ```

ncruces · 2025-07-01T03:32:16Z

I hope using inline asm for the load doesn't prevent the compiler from realising it's worth it to unroll the first iteration of the loop, since after that align == 0, and both the fast and slow paths are significantly simpler.

abrown · 2025-07-01T04:07:45Z

Good point. Another thing we discussed is how to benchmark this; at some point in one of these PRs (or somewhere) can you explain how others could replicate the numbers you observed?

ncruces · 2025-07-01T07:53:16Z

My benchmarks were similarly in Go and wazero specific so inapplicable.

But just as we did for tests, if you point me to a folder, I can easily create a WASI command that collects some numbers from within C, and runs on any WASI runtime.

The “infrastructure” (folders, build system, conventions) is more complicated to me than the test itself.

And I'm as interested as anybody in seeing numbers from other runtimes.

ncruces · 2025-07-01T11:43:20Z

Also, I just found __builtin_align_down, if you find this clearer, I can update this and subsequent PRs:

const v128_t *v = (v128_t *)__builtin_align_down((uintptr_t)s, sizeof(v128_t));
uintptr_t align = (uintptr_t)s - (uintptr_t)v;

It generates the exact same Wasm, and similarly requires going through uintptr_t to get specified behavior, so it's just a matter of possibly clearer intent.

When aligning pointers up or down, the resulting value must be within the same underlying allocation or one past the end. This means that arbitrary integer values stored in pointer-type variables must not be passed to these builtins. For those use cases, the builtins can still be used, but the operation must be performed on the pointer cast to uintptr_t.

ncruces · 2025-07-01T15:48:36Z

A benchmark like in this gist, gives me an improvement of around 3x for strlen and 2x for memchr for various lengths.

This seems less impressive than my previous benchmarks because this one is using mixed input lengths, whereas my benchmarks focused on larger inputs; doing that instead gives 7x and 5x respectively.

The trouble is where to drop this. Just putting it in the test/src directory is a failure, because anything that writes to stdout or stderr is considered a failure.

sunfishcode · 2025-07-01T17:26:59Z

Also, I just found __builtin_align_down, if you find this clearer, I can update this and subsequent PRs:
const v128_t *v = (v128_t *)__builtin_align_down((uintptr_t)s, sizeof(v128_t));
uintptr_t align = (uintptr_t)s - (uintptr_t)v;
It generates the exact same Wasm, and similarly requires going through uintptr_t to get specified behavior, so it's just a matter of possibly clearer intent.

The issue isn't so much how the pointer is computed, but the fact that the load ultimately accesses memory beyond the end of an allocation. No amount of casting the address to/from uintptr_t or __builtin_align_down or any other pointer computation can make it ok to access memory beyond the end of an allocation, even if it's just a read, and even if you don't ultimately depend on the values of those bytes.

The reason we're concerned about this is that LLVM follows the C rules, and the C rules are what say that you can't access memory beyond the end of an allocation. It's unfortunate that musl already does this, but at the moment we are protected somewhat by musl being defined in its own translation unit and not typically available for inlining. But since these SIMD routines are defined in a header, that exposes them to many more potential optimizations, which come with more risk of LLVM seeing partially-out-of-bounds loads and assuming they are fully UB.

ncruces · 2025-07-01T18:31:31Z

They are not "defined in an header", not in the merged implementation.

They were/are in the version I was/am using.

But the implementation is, in my reading of the standard, implementation defined behavior, which is actually defined (by clang) to do what it's supposed to do.

This commit is a refinement of WebAssembly#586 to use inline assembly to perform vector loads instead of using a C-defined load. This is done to avoid UB in LLVM where C cannot read either before or after an allocation. When `strlen` is not inlined, as it currently isn't, then there's not really any reasonable path that a compiler could prove that a load was out-of-bounds so this is issue is unlikely in practice, but it nevertheless is still UB. In the future the eventual goal is to move these SIMD routines into header files to avoid needing multiple builds of libc itself, and in such a situation inlining is indeed possible and a compiler would be capable of much more easily seeing the UB which could cause problems. Inline assembly unfortunately doesn't work with vector output parameters on Clang 19 and Clang 20 due to an ICE. This was fixed in llvm/llvm-project#146574 for Clang 21, but it means that the SIMD routines are now excluded with Clang 19 and Clang 20 to avoid compilation errors there.

alexcrichton · 2025-07-02T17:14:22Z

@alexcrichton plans to resolve the UB discussion with a follow-on PR that using a v128.load in inline assembly

now present at #593

sunfishcode · 2025-07-02T21:28:59Z

@ncruces Ah, my mistake then; there has been some discussion about implementing these in header files, and I missed that this PR isn't doing that.

This commit is a refinement of #586 to use inline assembly to perform vector loads instead of using a C-defined load. This is done to avoid UB in LLVM where C cannot read either before or after an allocation. When `strlen` is not inlined, as it currently isn't, then there's not really any reasonable path that a compiler could prove that a load was out-of-bounds so this is issue is unlikely in practice, but it nevertheless is still UB. In the future the eventual goal is to move these SIMD routines into header files to avoid needing multiple builds of libc itself, and in such a situation inlining is indeed possible and a compiler would be capable of much more easily seeing the UB which could cause problems. Inline assembly unfortunately doesn't work with vector output parameters on Clang 19 and Clang 20 due to an ICE. This was fixed in llvm/llvm-project#146574 for Clang 21, but it means that the SIMD routines are now excluded with Clang 19 and Clang 20 to avoid compilation errors there.

Continuing #580, followup to #586. Chose `memchr` because it's somewhat similar to `strlen`, but also because it is the basis for `strnlen` (and in that capacity, for `strndup` and `strlcat`) and is also used by `strstr`, `fnmatch`.

Continuing #580, implements `strspn` and `strcspn`. This one follows the same general structure as #586, #592 and #594, but uses a somewhat more complicated algorithm, described [here](http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html). I used the Geoff Langdale alternative implementation (the tweet as since disappeared) which is correctly described there but has a subtle bug in the implementation: WojciechMula/simd-byte-lookup#2 Since the complexity needed for `__wasm_v128_bitmap256_t` is shared for both `strspn` and `strcspn`, I moved the implementation to a common file, when SIMD is used. The tests follow a similar structure as the previous ones, and cover the bug, which I was found through fuzzing.

abrown reviewed Jun 12, 2025

View reviewed changes

abrown mentioned this pull request Jun 17, 2025

ci: improve the testing of multiple configurations #587

Closed

abrown reviewed Jun 17, 2025

View reviewed changes

test/src/misc/strlen.c Show resolved Hide resolved

test/src/misc/strlen.c Show resolved Hide resolved

test/src/misc/strlen.c Show resolved Hide resolved

alexcrichton mentioned this pull request Jun 18, 2025

Refactor how CI works in this repository #588

Merged

alexcrichton reviewed Jun 18, 2025

View reviewed changes

ncruces force-pushed the simd branch 2 times, most recently from f3bd255 to 261eadc Compare June 25, 2025 14:40

abrown reviewed Jun 25, 2025

View reviewed changes

.github/workflows/main.yml Outdated Show resolved Hide resolved

ncruces requested review from abrown and alexcrichton June 26, 2025 22:12

ncruces added 9 commits June 30, 2025 13:39

Optional SIMD strlen

bbe4122

Consistent formatting.

28d54cd

Comment.

fe079cd

Pass CI.

583021e

Add test.

b699b1c

Avoid constant.

85bf59b

Test crosses page boundary.

13d318b

Comments.

899753f

Typo.

86ffe52

ncruces added 4 commits June 30, 2025 13:39

Avoid undefined behavior.

f67590b

CI test.

b2158bd

Fix.

266d6a1

Workflow name, comments.

7392b47

ncruces force-pushed the simd branch 2 times, most recently from b666430 to 7392b47 Compare June 30, 2025 12:45

abrown approved these changes Jun 30, 2025

View reviewed changes

abrown merged commit 553305f into WebAssembly:main Jun 30, 2025
15 checks passed

ncruces mentioned this pull request Jul 1, 2025

Optional SIMD memchr #592

Merged

ncruces mentioned this pull request Jul 2, 2025

Using SIMD for string.h #580

Open

alexcrichton mentioned this pull request Jul 2, 2025

Use inline assembly in strlen for vector loads #593

Merged

ncruces mentioned this pull request Jul 17, 2025

Optional SIMD str(c)spn #597

Merged

Conversation

ncruces commented Jun 11, 2025

Uh oh!

ncruces commented Jun 12, 2025

Uh oh!

abrown commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abrown left a comment

Choose a reason for hiding this comment

Uh oh!

ncruces commented Jun 12, 2025

Uh oh!

ncruces commented Jun 12, 2025

Uh oh!

ncruces commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abrown commented Jun 17, 2025

Uh oh!

ncruces commented Jun 17, 2025

Uh oh!

abrown left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrown commented Jun 17, 2025

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

alexcrichton Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

alexcrichton Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

ncruces Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

ncruces Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncruces Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

alexcrichton Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

ncruces Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

ncruces Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncruces Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncruces commented Jun 25, 2025

Uh oh!

Uh oh!

abrown left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ncruces commented Jul 1, 2025

Uh oh!

abrown commented Jul 1, 2025

Uh oh!

ncruces commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ncruces commented Jul 1, 2025

Uh oh!

ncruces commented Jul 1, 2025

abrown commented Jun 12, 2025 •

edited

Loading

ncruces commented Jun 13, 2025 •

edited

Loading

ncruces Jun 20, 2025 •

edited

Loading

ncruces Jun 25, 2025 •

edited

Loading

ncruces Jun 27, 2025 •

edited

Loading

ncruces commented Jul 1, 2025 •

edited

Loading

ncruces commented Jul 1, 2025 •

edited

Loading